Project Overview PR2.20ΒΆ
The project goal is to implement clustering on metereological data collected for Emilia-Romagna provinces.
The aim of the analysis is to discover which provinces are more similar in terms of polluting chemicals.
The data is composed of different csv files, where each file represents a province.
The main challenge was to apply clustering to time series data, which requires more advanced methodologies than classical clustering algorithms.
SetupΒΆ
In this section we show the importing procedure and cleaning steps performed on the dataset.
Each dataset is imported into a dictionary where each key is the province name
ImportΒΆ
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
script_path = os.getcwd()
path = os.path.dirname(script_path)
def import_data(path):
data = R'data'
data_path = os.path.join(path, data)
# List all files in the folder
csv_files = [file for file in os.listdir(data_path) if file.endswith(".csv") if file != 'InfoComune.csv']
# Create an empty dictionary to store DataFrames
dataframes = {}
# Iterate through each CSV file
for file in csv_files:
# Extract the file name
df_name = os.path.splitext(file)[0]
# Create the DataFrame and store it in the dictionary
dataframes[df_name] = pd.read_csv(os.path.join(data_path, file), header=0, skiprows = [1])
return dataframes
dataframes = import_data(path)
# store dictionary items in specific variables to make it easier to loop through them
datasets = dataframes.values()
provinces = dataframes.keys()
data_prov_pairs = dataframes.items()
With the following we keep track of the main variables we wish to keep and their position in the datasets
# columns to keep the average value only
pollutants = ['CO', 'NH3', 'NMVOC', 'NO2', 'NO', 'O3', 'PANS', 'PM10', 'PM2.5', 'SO2']
# metereological information
met = ['TG', 'TN', 'TX', 'HU', 'PP', 'QQ', 'RR']
met_pos = range(6, 13)
# date values
date = ['YYYY', 'MM', 'DD']
date_pos = list(range(3))
Cleaning and filteringΒΆ
def clean_data(data):
for province, df in data_prov_pairs:
# change missing values to the proper format
df.replace('---', np.nan, inplace = True)
# ensure a unique format
df = df.convert_dtypes()
# rename the columns for date and metereological information
old_date = df.columns[date_pos]
old_met = df.columns[met_pos]
df.rename(columns=dict(zip(old_date, date)), inplace=True)
df.rename(columns=dict(zip(old_met, met)), inplace=True)
df = df[selected_columns]
# Combine 'YYYY', 'MM', 'DD' columns into a new 'date' column
df['date'] = pd.to_datetime(df[['YYYY', 'MM', 'DD']].astype(str).agg('-'.join, axis=1), format='%Y-%m-%d')
# Remove 'YYYY', 'MM', 'DD' columns
df.drop(['YYYY', 'MM', 'DD'], axis=1, inplace=True)
# Reorder columns with 'date' as the first column
df = df[['date'] + [col for col in df.columns if col != 'date']]
# first convert to numeric the columns in met and pollutants, since they are strings
df[numerics] = df[numerics].apply(pd.to_numeric, errors = 'coerce')
# round to the second decimal number for better visualization
df[numerics] = df[numerics].round(2)
# we want to filter the series so that we don't have missing values
# We'll start from 2018-01-01 and move until 2020-12-28
df = df[(df['date'] >= pd.to_datetime('2018-01-01')) & (df['date'] <= pd.to_datetime('2020-12-28'))]
dataframes[province] = df
return dataframes
dataframes = clean_data(dataframes)
Missing valuesΒΆ
dataframes['bologna'][dataframes['bologna'].isnull().any(axis=1)]
We find that 2020-02-29 is still missing for some variables.
Replacing missing valuesΒΆ
Since the remaining missing value is not missing at the beginning or at the end of the series, it was decided to impute the value with the median, since it's robust to extreme values
def impute_median(data):
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
for province, df in data_prov_pairs:
df[numerics] = imputer.fit_transform(df[numerics])
dataframes[province] = df
return dataframes
dataframes = impute_median(dataframes)
SmoothingΒΆ
We applied smoothing to the series to reduce the influence of extreme observations, both on clustering and on visualizations.
The Savitzky-Golay filter applies polynomial smooting on the time span indicated by the window parameter.
The advantage of this function compared to other smoothing methods like moving average is that it doesn't introduce missing values in the series.
For the variable PP, for which we want to show the plot, we applied different and more strict parameters because it showed more extreme values than the other variables.
from scipy.signal import savgol_filter
def smooth(window, window_pp, poly_pp, poly = 2):
for province, df in data_prov_pairs:
for column in df[numerics].columns:
# Extract numerical values from the DataFrame
values = df[column].values
# Apply Savitzky-Golay filter to the numerical values
# for PP we need higher smooting due to outliers
if column == 'PP':
window = window_pp
poly = poly_pp
smoothed_values = savgol_filter(values, window, poly)
# Update the DataFrame with the smoothed values
df[column] = smoothed_values
dataframes[province] = df
smooth(window = 10, window_pp = 25, poly_pp = 5)
ConcatenatingΒΆ
Concatenation was required to produce plots and analysis by group.
In the concatenation process, the group variable is appended at the end of the series to indicate membership.
The genarality of the function allows to apply it both for analysis by province and by clusters.
def concat_group(groups, group_name: str):
# Create an empty list to store modified dataframes
dfs = []
# Iterate through the dictionary values and groups labels
for df, group in zip(datasets, groups):
# Add a 'group' column
df[group_name] = group
# Append the modified dataframe to the list
dfs.append(df)
# Concatenate all dataframes in the list along the rows
full_df = pd.concat(dfs, ignore_index=True)
full_df
return full_df
VisualizationΒΆ
Here we show visualizations of the smoothed series.
The function relies on the output of concatenation to produce plots by group.
From the plots we see that variables about meteorological informtation (TG, TN, TX, HU, PP, QQ, RR) are highly similar across provinces, therefore we opted for excluding them when clustering, given the problem is already high dimensional.
def plot_group(subfolder: str, vars, group: str, full_df):
# subfolder should be a string indicating the name of the subfolder in the figures_folder
# vars should be the list containing the variable names, therefore either pollutants, mets, numerics
# group should be the variable we want to group by (eg: province, clusters,...)
# to group we need to have created the full dataframe before with the specific group variable
sns.set(style="whitegrid")
figures = R'figures'
figures_folder = os.path.join(path, figures)
output_folder = os.path.join(figures_folder, subfolder)
os.makedirs(output_folder, exist_ok=True)
for var in vars:
# Plot the time series for each province
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y=var, hue=full_df[group], data=full_df)
plt.title(f'Meteorological Information Over Time by {group}')
plt.xlabel('Date')
plt.ylabel(var)
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
save_path = os.path.join(output_folder, f'{var}.png')
plt.savefig(save_path, bbox_inches='tight')
plt.show()
full_df = concat_group(provinces, 'province')
plot_group('smooth_series', numerics, 'province', full_df)
Time Series K-MeansΒΆ
In this section we implemented a variation of KMeans suited for clustering of time series data (Time Series KMeans).
The main difference from the traditional KMeans algorithm is that it requires a different distance metric than the Euclidean distance. The reasons is that in time series, Euclidean distance can introduce distortions and not represent distances of time series correctly.
The most common distance metric used for time series is Dynamic Time Warping (DTW).
#importing of the required packages
from tslearn.clustering import TimeSeriesKMeans, silhouette_score
from sklearn.preprocessing import MinMaxScaler
Data preparation for clusteringΒΆ
First, we normalized the data, which is a necessary step before applying any clustering algorithm to ensure distances are not distorted.
for df in datasets:
df[pollutants] = MinMaxScaler().fit_transform(df[pollutants])
Data structureΒΆ
The package used (tslearn) required creating a 3D array to pass as input, where each layer in the 3rd dimension is a specific province.
#Prepare input data
X = []
for df in datasets:
group_data = df[pollutants].values
group_data = np.expand_dims(group_data, axis=0)
X.append(group_data)
#Stack the list of arrays to create a 3D array
X = np.vstack(X)
Optimal number of clustersΒΆ
The first step of Kmeans is determining the optimal number of cluster (k).
This is done by running the algorithm for various choices of k, and for each result we extracted 2 common evaluation metrics:
- Inertia (Sum of Within clusters distances)
- Silhouette Scores.
inertia = []
silhouette = []
K = list(range(2, 9))
for k in K:
km = TimeSeriesKMeans(n_clusters=k, n_init=5, metric='dtw', random_state=0)
labels = km.fit_predict(X)
silhouette_avg = silhouette_score(X, labels, metric='dtw')
inertia.append(km.inertia_)
silhouette.append(silhouette_avg)
Optimal number of clustersΒΆ
In the plot showing inertia, it's difficult to determine for which k the rate of change of inertia decreases significantly. On the contrary, silhouette score is clearly highest for 4 clusters.
# Plotting inertia vs slihouette
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(K, inertia, marker='o')
plt.title('Inertia vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.subplot(1, 2, 2)
plt.plot(K, silhouette, marker='o')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.tight_layout()
figures_folder = os.path.join(path, R'figures')
plt.savefig(os.path.join(figures_folder, 'inertia_silhouette.png'))
plt.show()
Optimal number of clustersΒΆ
KMeans with 4 clustersΒΆ
km4 = TimeSeriesKMeans(n_clusters=4, n_init=5, metric='dtw', random_state=0)
clusters = km4.fit_predict(X)
def table_clusters(provinces = provinces, clusters = clusters):
# Specify the results folder
results = R"results"
results_path = os.path.join(path, results)
# Create a DataFrame
cluster_df = pd.DataFrame(list(zip(provinces, clusters)), columns=['Province', 'Cluster'])
# Create a figure and axis
fig, ax = plt.subplots(figsize=(5, 5))
# Hide the axes
ax.axis('off')
# Create a table and add it to the axis
table = ax.table(cellText=cluster_df.values, colLabels=cluster_df.columns,
cellLoc='center', loc='center', edges = 'open')
# Make column labels bold
for (i, j), cell in table.get_celld().items():
if i == 0:
cell.set_text_props(fontweight='bold')
# Adjust column width
table.auto_set_column_width([0, 1])
# Adjust the position of the table within the figure
table.set_fontsize(13) # Adjust font size
table.scale(1, 2) # Scale the table
save_file = os.path.join(results_path, 'cluster_provinces.png')
# Save the figure as a PNG image in the specified folder
fig.savefig(save_file, bbox_inches='tight', pad_inches=0.5, dpi=300)
table_clusters()
Provinces and clustersΒΆ
Summary statistics of clustersΒΆ
Here we show tables containing statistics for the various clusters.
df_cluster = concat_group(clusters, 'cluster')
def summary(full_df):
from pandas.plotting import table
grouped_df = full_df[['cluster'] + pollutants].groupby('cluster')
summary_stats = grouped_df.describe()
stats_folder = R'results\statistics'
stats_path = os.path.join(path, stats_folder)
os.makedirs(stats_path, exist_ok=True)
for variable in pollutants:
selected_stats = summary_stats[variable][['count', 'mean']]
selected_stats = selected_stats.reset_index()
# Add a new column 'cluster' with values 0, 1, 2, 3
selected_stats['cluster'] = [0, 1, 2, 3]
# prepare columns to ensure maximum 4 digits
selected_stats['mean'] = round(selected_stats['mean'], 4)
# ensure integers do not show decimal points by converting them to strings
selected_stats['count'] = selected_stats['count'].astype(int).astype(str)
selected_stats['cluster'] = selected_stats['cluster'].astype(str)
# Move 'cluster' column to the first position
selected_stats = selected_stats[['cluster', 'count', 'mean']]
# Plotting parameters
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('off') # Turn off axis
cell_data = [selected_stats.columns] + selected_stats.values.tolist()
table = ax.table(cellText=cell_data, loc='center', cellLoc='center', colLabels=None, edges='open')
# Adjust column width
table.auto_set_column_width([0, 1])
# Adjust the position of the table within the figure
table.set_fontsize(15) # Adjust font size
table.scale(1, 1.5) # Scale the table
# Add a title
ax.set_title(f"Summary Statistics for {variable}", fontsize=18, y=0.9)
# Make column labels bold
for (i, j), cell in table.get_celld().items():
if i == 0:
cell.set_text_props(fontweight='bold')
# Save the plot as an image
save_name = os.path.join(stats_path, f'{variable}_table.png')
plt.savefig(save_name, bbox_inches='tight', pad_inches=0.2)
plt.show()
summary(df_cluster)
ConclusionsΒΆ
- Time Series KMeans with Dynamic Time Warping proposed 4 clusters among the provinces, but the results are difficult to interpret
- Many variables have similar statistics across clusters, which means the clusters themselves are not extremely different
- There are some clusters that differ significantly when it comes to some variables:
- NH3: 0-1
- NMVOC: 0-3
- O3: 2-3
- PM2.5: 3-0/2
- Graphs by provinces also show that provinces have generally similar levels of the polluting chemicals